Skip to content

Fix SDE CI job hanging indefinitely after tests complete#31

Merged
Malkovsky merged 1 commit intomainfrom
fix/sde-job-hang
Feb 24, 2026
Merged

Fix SDE CI job hanging indefinitely after tests complete#31
Malkovsky merged 1 commit intomainfrom
fix/sde-job-hang

Conversation

@kilo-code-bot
Copy link
Contributor

@kilo-code-bot kilo-code-bot bot commented Feb 24, 2026

Problem

The build-and-test-with-SDE job sometimes hangs indefinitely after all tests pass. The tests themselves complete successfully, but the SDE process never exits — causing the job to block until GitHub Actions' default 6-hour timeout.

Root Cause

Intel SDE is built on the PIN binary instrumentation framework. On Linux (including GitHub Actions runners), SDE can deadlock during process teardown when:

  1. AddressSanitizer (-fsanitize=address) installs atexit() handlers for leak-detection cleanup
  2. Non-trivial static/global destructors exist in the binary (e.g. the LUT8Tables Meyers singleton in rmm_tree.h, __m256i file-scope statics in bits.h)
  3. PIN's internal VM shutdown races with the ASan teardown sequence on Linux VMs

This is a known issue reported in Intel's community forums (PIN NotifyExit: assertion failed: _initialized).

Fix

Two layers of protection:

  • timeout-minutes: 60 at the job level — hard backstop so the job never consumes the full 6-hour default.
  • timeout 1800 per SDE step — kills SDE if it doesn't exit within 30 minutes. Combined with --gtest_output=xml, the exit code 124 (timeout) is treated as success only if the XML report confirms failures="0", meaning all tests passed and SDE merely hung during teardown rather than being killed mid-test.

SDE (built on Intel PIN) can hang during process teardown on Linux/GitHub
Actions when ASan atexit handlers and non-trivial static destructors run.
This causes the build-and-test-with-SDE job to block until the 6-hour
GitHub Actions default timeout.

Fix:
- Add timeout-minutes: 60 at the job level as a hard backstop
- Wrap each SDE invocation with  and pass
  --gtest_output=xml so the GTest XML report can be inspected
- If timeout fires (rc=124) but the XML confirms failures="0",
  the step is treated as success — tests passed, SDE merely hung
  on teardown

- name: Build Project
working-directory: ./build
run: make -j
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUGGESTION: Use dynamic parallelism for faster builds

On GitHub runners, make -j defaults to unlimited jobs, which can oversubscribe CPU and slow builds. Consider make -j$(nproc) (or cmake --build . -j$(nproc)) to align with available cores for more consistent performance.

@kilo-code-bot
Copy link
Contributor Author

kilo-code-bot bot commented Feb 24, 2026

Code Review Summary

Status: 1 Issues Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 1
Issue Details (click to expand)

SUGGESTION

File Line Issue
.github/workflows/build-test.yml 66 Use dynamic parallelism (e.g., -j$(nproc)) to avoid oversubscription and improve build performance.
Files Reviewed (1 files)
  • .github/workflows/build-test.yml - 1 issues

Fix these issues in Kilo Cloud

@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@Malkovsky Malkovsky merged commit 0e1ee80 into main Feb 24, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants